Get 2017 MLB Salaries

The salaries that are available in Salaries2017.CSV were downloaded from USA Today at https://www.usatoday.com/sports/mlb/salaries/2017/player/all/.

  sal2017 = read.csv("Salaries2017.csv", header=TRUE)
  head(sal2017)
##   Rank             Name Team POS   Salary        Years Total.Value
## 1    1  Clayton Kershaw LAD   SP 33000000  7 (2014-20)   215000000
## 2    2     Zack Greinke ARI   SP 31876966  6 (2016-21)   206500000
## 3    3      David Price BOS   SP 30000000  7 (2016-22)   217000000
## 4    4   Miguel Cabrera DET   1B 28000000 10 (2014-23)   292000000
## 5    4 Justin Verlander DET   SP 28000000  7 (2013-19)   180000000
## 6    6    Jason Heyward CHC   RF 26055288  8 (2016-23)   184000000
##   Avg.Annual                                                        Source
## 1   30714286 https://www.usatoday.com/sports/mlb/salaries/2017/player/all/
## 2   34416666                                                     10/1/2017
## 3   31000000                                                              
## 4   29200000                                                              
## 5   25714285                                                              
## 6   23000000

Population Values

Numerical and graphical descriptives of the population (2017 MLB players) can be generated. We will focus on the annual salary for the year 2017.

  N = nrow(sal2017)
  summary(sal2017$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   535000   545500  1562500  4468069  6000000 33000000
  mean(sal2017$Salary)
## [1] 4468069
  sd(sal2017$Salary)
## [1] 5948459
  hist(sal2017$Salary)

  boxplot(sal2017$Salary)

Sample Statistics

Suppose that we now sample from the population as if we were unable to get information for all 868 players who were on the initial 2017 rosters. The sample statistics should be representative of the population values (parameters).

  n = 30
  smplIDs = sample(1:N,n)
  smpl = sal2017[sort(smplIDs),]
  head(smpl)
##     Rank         Name Team POS   Salary       Years Total.Value Avg.Annual
## 34    34   Mike Trout LAA   CF 20083333 6 (2015-20)   144500000   24083333
## 63    63 Jake Arrieta CHC   SP 15637500    1 (2017)    15637500   15637500
## 83    83   J.J. Hardy BAL   SS 13636781 3 (2015-17)    40000000   13333333
## 92    90 Josh Reddick HOU   RF 13000000 4 (2017-20)    52000000   13000000
## 111  104 Todd Frazier CWS   3B 12000000    1 (2017)    12000000   12000000
## 164  164  Brett Cecil STL   RP  7750000 4 (2017-20)    30500000    7625000
##     Source
## 34        
## 63        
## 83        
## 92        
## 111       
## 164
  summary(smpl$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   535000   549875  1090000  3851130  3750000 20083333
  mean(smpl$Salary)
## [1] 3851130
  sd(smpl$Salary)
## [1] 5400181
  hist(smpl$Salary)

  boxplot(smpl$Salary)

  smpl1 = smpl

If we repeat the process, it is unlikely that we will get the same values for the statistics. This is not surprising since it is unlikely that our new sample will contain the same 30 players that the original sample contained.

  n = 30
  smplIDs = sample(1:N,n)
  smpl = sal2017[sort(smplIDs),]
  head(smpl)
##     Rank         Name Team POS   Salary       Years Total.Value Avg.Annual
## 27    27  Chris Davis BAL   1B 21233006 7 (2016-22)   161000000   23000000
## 43    43 Hunter Pence  SF   RF 18700000 5 (2014-18)    90000000   18000000
## 83    83   J.J. Hardy BAL   SS 13636781 3 (2015-17)    40000000   13333333
## 84    84 Bryce Harper WSH   RF 13625000    1 (2017)    13625000   13625000
## 108  104   Chris Sale BOS   SP 12000000 5 (2013-17)    32500000    6500000
## 167  167   Dee Gordon MIA   2B  7742202 5 (2016-20)    50000000   10000000
##     Source
## 27        
## 43        
## 83        
## 84        
## 108       
## 167
  summary(smpl$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   535000   545250  1650000  4398326  4450000 21233006
  mean(smpl$Salary)
## [1] 4398326
  sd(smpl$Salary)
## [1] 5738254
  hist(smpl$Salary)

  boxplot(smpl$Salary)

  smpl2 = smpl

As we suspected, the mean of the first sample (3.851130510^{6}) is not equal to the mean of the second sample (4.39832610^{6}). The values are close to each other. They are also close to the population mean (4.468069210^{6}).